The following Tutorial is the final assessment of the project seminar “Species Distribution Modeling” at Philipps-University Marburg. In this tutorial we’re going to use the XGBoost algorithm to predict the specie´s distribution of butterflies in Pakistan and create a species richness map of the country. XGBoost (eXtreme Gradient Boosting) is a popular machine learning algorithm that belongs to the family of gradient boosting methods. It was developed by Tianqi Chen. and uses a combination of gradient boosting, decision trees, regularization, gradient-based optimization, feature importance analysis, parallelization. All this make’s it a robust and powerful algorithm that often delivers state-of-the-art results in various machine learning tasks. You will be introduced to the basic concepts of XGBoost and you’ll be provided with a reproducible workflow to use XGBoost to build classification models.
XGBoost is a ensemble Method same as Random Forrest, this means it combines the output of multiple Trees. But the methods differ in the way the idividual Trees are build and how the results are combined. In Xgboost the Output oft the Trees aren’t combined equally. Instead XGBoost uses a method called boosting. Boosting combines weak learner (small trees) sequentually so that the new tree corrects the errors oft he previous one. To understand this we have to look into some mathematical details. But dont worry when using XGBoost these details will be automated. Nevertheless its importand to understand these processes to optimize the algorithm later on.
As said XGBoost builds on many concepts to deliver it’s often outstanding results. We’re going to start with the mathematical base concepts of how XGBoost builds trees.
In this assessment were trying to classify geo-points if they’re potential habitats for a number of butterfly species. In order to do that we need XGBoost to build classification trees. XGBoost work’s also with regression but the process differs a little, so were not going to focus on that.
How XGBoost builds trees is limited by multiple regularization parameters:
We’ve heard of Lambda when we’re calculated the similarity-score. XGBoost default value for Lambda is 0 therefore we’ve been ignoring it. But when Lambda is set to >0 the similarity-score get’s smaller because the denominator becomes larger. Thus Lambda prevents over-fitting.
Another regularization parameter is the Cover or min_child_weight. This parameter is also the reason why we haven’t continued building our example tree. In XGBoost the default value for the cover is 1 wich means that every leaf with a cover value less than 1 doesn’t get build. The cover for classification tree’s is calculated by summing the previous probability times 1 minus the previous probability, for each residual in the leaf.
Similar to Cover (min_child_weigth) Gamma is a regularization parameter that causes XGBoost to only build new leafs when the Gain-Value is larger than Gamma.
Therefore it prevents overfitting the trees to our data. Gamma is a highly specialized regularization parameter, what mean’s that there is no “good” value. By default it’s 0 therefore no regularization takes place.
XGBoost substract’s gamma from the Gain-Value and then removes the leaf if the result is a negative number. For example if we take the previous calculated Gain-Value of our example tree of 3.8 a gamma-value of 4 would prune the whole tree down to the root. But if the gamma-value is just 0 XGBoost can build extremly large trees thus overfitting the trees to the dataset and raising the computation time a lot.
XGBoost is available for different programming and scripting languages, include Python, R, Java and C++. Docuementation is available here: https://xgboost.readthedocs.io/en/stable/
Before you dive into the code you need to install some packages this script will use:
install.packages("terra")
install.packages("ggplot2")
install.packages("fastDummies")
install.packages("tidyterra")
The XGBoost package can by installed in two different ways: * First there is the default package from CRAN, which will do it in most situations.
install.packages("xgboost")
TODO: For Benchmarks of a High-End CPU vs Low-End GPU, see https://medium.com/data-design/xgboost-gpu-performance-on-low-end-gpu-vs-high-end-cpu-a7bc5fcd425b. Risks? Reproducability?
##
## !! Installation on windows failed with "Warning in system("sh ./configure.win") 'sh' not found", for dirty Solution see section Troubleshooting
##
# Install dependencies
install.packages(c("data.table", "jsonlite"))
# Install XGBoost
system(paste("R CMD INSTALL ", getwd(), "/xgboost_r_gpu_win64_21d95f3d8f23873a76f8afaad0fee5fa3e00eafe.tar.gz", sep=""))
Don’t forget to enable CPU acceleration in the knitting paramters.
After installing all needed libaries you need to load them:
require(dplyr) # easy dataframe modification
require(ggplot2) # plotting
require(geodata) # downloading geospatial world dataset made easy
require(sf) # simple geospatial features
require(terra) # raster manipulation
require(tidyterra) # plot terra objects with ggplot
require(fastDummies)# create binary factor columns from character column
require(xgboost) # our modeling libary
Let’s begin with preparing the data used to train the model. Start with getting a overview of the provided data:
species_occurrences_all <- read.table("data/PakistanLadakh.csv", sep=",", header=TRUE)
species_occurrences_all <- sf::st_as_sf(species_occurrences_all, coords=c("x", "y"), remove=TRUE, crs=sf::st_crs("epsg:4326"))
str(species_occurrences_all)
## Classes 'sf' and 'data.frame': 5870 obs. of 2 variables:
## $ species : chr "Acraea_issoria" "Acraea_issoria" "Acraea_issoria" "Acraea_issoria" ...
## $ geometry:sfc_POINT of length 5870; first list element: 'XY' num 73.1 34.4
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA
## ..- attr(*, "names")= chr "species"
ggplot() +
geom_sf(data = sf::st_as_sf(species_occurrences_all), mapping=aes(color=species), show.legend = FALSE) +
ggtitle("Oberserved occurrence of butterflies in Pakistan")
Next we need some environmental data to train the model. Therefor we selected the bioclim data, which are widely used in speceis distribution modeling (Source: https://isprs-archives.copernicus.org/articles/XLII-4-W19/449/2019/isprs-archives-XLII-4-W19-449-2019.pdf#:~:text=The%20earliest%20studies%20of%20SDM%20used%20BIOCLIM%20-,requires%20species%20occurrence%20data%20%28latitude%2C%20longitude%2C%20and%20elevation%29.). The Bioclim layers are missing elevation data, we will use those too, since temperature is dependent on elevation. The Border of Pakistan is also needed, we will crop our data with that, so the model doesn’t train areas we don’t not have presence points. The elevation data is alread cropped so we don’t need to repeat this.
# political border of pakistan
border_pak <- geodata::gadm(country='PAK', level = 0, path='./data')
ggplot() +
geom_sf(data = sf::st_as_sf(border_pak), fill=NA)
# bioclim data from pakistan
bioclim_pak <- geodata::worldclim_country(country = "PAK", res = 10, var = "bio", path = "data/", version = "2.1")
names(bioclim_pak) <- substr(names(bioclim_pak), 11, 20) # TODO: rename to mare meaningful names or show table of layers in text
bioclim_pak <- terra::mask(bioclim_pak, border_pak)
ggplot() +
geom_spatraster(data = bioclim_pak) +
facet_wrap(~lyr) +
geom_sf(data = sf::st_as_sf(border_pak), fill = NA, show.legend = FALSE) +
ggtitle("Bioclim data of Pakistan")
# elevation data form pakistan
elevation_pak <- geodata::elevation_30s(country = 'PAK', path = 'data/')
ggplot() +
geom_spatraster(data = elevation_pak) +
geom_sf(data = sf::st_as_sf(border_pak), fill = NA, show.legend = FALSE) +
scale_fill_hypso_c(name = "Elevation")
So lets define our species_occurrences as presence points :
species_presence <- species_occurrences_all
rm(species_occurrences_all)
As absence points we will user random Points in Pakistan and combine them with the presence points. The absence points will be extended by a column “species”, which matches the column “species”´in the presence points.
# Generate random points inside pakistan as background points and extend them with a column for species = NA
# TODO: why 1000 points?? give a explanation for the decision
border_pak <- sf::st_as_sf(border_pak)
species_absence <- sf::st_sample(border_pak, size = 1000)
# TODO: add col for occurrence = 1 or occurrence = 0 here, dummycols can then be thrown away
# adding col species = NA to the background points, needed for rbind to join the data
species_absence <- cbind(species_absence, data.frame(species = as.character(NA)))
species_absence <- sf::st_as_sf(species_absence)
# Combine presence and absence (background) points into a single object
modeling_data_ <- rbind(species_presence, species_absence)
# Only points inside Pakistan should be used for modeling, also remove the columns added by the intersection.
modeling_data_ <- sf::st_intersection(modeling_data_, border_pak) %>% select(-COUNTRY, -GID_0)
Now we got our presence and absence points as spatial data. Finally we will extract values from our environmental data and add those to our modeling data, so xgboost can use this table to train its model.
# Extract values from bioclim and elevation, join them to our modeling_data
extraction_bioclim_pak <- terra::extract(bioclim_pak, modeling_data_, bind=FALSE, ID=FALSE)
extraction_elevation_pak <- terra::extract(elevation_pak, modeling_data_, bind=FALSE, ID=FALSE)
modeling_data_extracted <- cbind( modeling_data_, extraction_bioclim_pak, extraction_elevation_pak)
Clean up of no longer needed variables and check the final modeling data:
# create a final data variable and clean up variables
modeling_data <- modeling_data_extracted
rm(species_occurrences_all); rm(species_presence); rm(species_absence); rm(modeling_data_); rm(extraction_bioclim_pak); rm(extraction_elevation_pak); rm(modeling_data_extracted)
## Warning in rm(species_occurrences_all): Objekt 'species_occurrences_all' nicht
## gefunden
str(modeling_data)
ggplot() +
geom_sf(data = sf::st_as_sf(border_pak), fill=NA, show.legend=FALSE) +
geom_sf(data = sf::st_as_sf(modeling_data), mapping=aes(color=species), show.legend = FALSE) +
ggtitle("Oberserved occurrence of butterflies in Pakistan plus background points")
## Classes 'sf' and 'data.frame': 1058 obs. of 22 variables:
## $ species : chr "Acraea_issoria" "Acraea_issoria" "Acraea_issoria" "Aglais_caschmirensis" ...
## $ bio_1 : num 17.7 12.25 15.35 16.44 -3.51 ...
## $ bio_2 : num 12.12 10.44 11.33 11.86 9.16 ...
## $ bio_3 : num 37.9 36.5 37.7 36.8 27.8 ...
## $ bio_4 : num 719 682 682 765 852 ...
## $ bio_5 : num 33.8 26.1 30.1 32.1 13.3 ...
## $ bio_6 : num 1.8 -2.5 0 -0.1 -19.6 ...
## $ bio_7 : num 32 28.6 30.1 32.2 32.9 ...
## $ bio_8 : num 24.7 19.2 22 11.2 -10.4 ...
## $ bio_9 : num 13.98 9.37 12.07 12.6 -1.78 ...
## $ bio_10 : num 26.02 20.15 23.17 25.47 6.92 ...
## $ bio_11 : num 8.3 3.37 6.37 6.7 -13.65 ...
## $ bio_12 : num 1358 1376 1005 699 707 ...
## $ bio_13 : num 275 239 169 117 122 80 74 75 116 107 ...
## $ bio_14 : num 28 26 22 16 18 14 13 11 20 18 ...
## $ bio_15 : num 68.9 57.8 55.9 55.8 57.2 ...
## $ bio_16 : num 628 585 413 276 319 213 197 196 308 280 ...
## $ bio_17 : num 119 133 99 74 82 49 45 45 91 70 ...
## $ bio_18 : num 619 572 406 210 107 63 53 57 110 116 ...
## $ bio_19 : num 251 259 195 139 205 101 91 97 210 144 ...
## $ PAK_elv_msk: num 1219 2315 1581 1628 4563 ...
## $ geometry :sfc_POINT of length 1058; first list element: 'XY' num 73.1 34.4
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "names")= chr [1:21] "species" "bio_1" "bio_2" "bio_3" ...
species_nsamples = data.frame(modeling_data) %>%
count(species, sort=TRUE) %>%
rename(n_samples = n) %>%
filter(!is.na(species))
ggplot(species_nsamples, aes(n_samples)) +
geom_histogram(binwidth = 5) +
geom_vline(aes(xintercept=mean(n_samples)), linetype="dashed") +
annotate(x=mean(species_nsamples$n_samples), y=+Inf, label=paste("Mean:",round(mean(species_nsamples$n_samples),2)), vjust=3, geom="label") +
labs(x = "Number of Samples", y = "Number of Species")
rm(species_nsamples)
Starting with the training of our xgboost model we decided to do sperate training for every model. Despite that xgboost is capable of Multi-Classificiation. Therefor we defined a function ‘train’ wich invokes out data filtering and xgboost specific data preperation to meet the requirements of xgboost. Espacially converting the modeling data into a ‘xgb.DMatrix’ object. Finally we define some general parameters we want to use, e.g. enable of gpu acceleration by knit paramters. Last but not least we save the model to the disk to preserve it for modeling.
train <- function(
data, # training data, required
sp, # species name, required
xgb_params = list("nrounds" = 1000), # xgboost params, see https://xgboost.readthedocs.io/en/latest/parameter.html
save_model = TRUE
) {
# filter modeling data to current species, don't forget the absence points!
data <- modeling_data %>% filter(species == sp | is.na(species))
data <- dummy_columns(data, select_columns = "species",
ignore_na = TRUE,
remove_selected_columns = TRUE)
# make dummy_columns name persistent over all species
# replace NA values with 0
data <- data %>%
rename(species_occurrence = starts_with("species_")) %>%
mutate(species_occurrence = ifelse(is.na(species_occurrence), 0, species_occurrence)) %>%
mutate(species_occurrence = factor(species_occurrence))
# xgboost need a specific data format
data <- xgb.DMatrix(
data = as.matrix(data %>% select(-species_occurrence, -geometry)),
label = as.matrix(data %>% select(species_occurrence))
)
# enable gpu acceleration only if wanted
if(params$gpu_acc) {
xgb_params = c(xgb_params, predictor="gpu_predictor")
xgb_params = c(xgb_params, tree_method="gpu_hist")
xgb_params = c(xgb_params, sampling_method = "gradient_based")
}
#xgb_params = c(xgb_params, eval_metric="error") # Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
xgb_params = c(xgb_params, objective = "binary:logistic") # logistic regression for binary classification, output probability
model <- xgboost(data = data,
verbose = 0, # 0 (silent), 1 (warning), 2 (info), 3 (debug)
nrounds = xgb_params$nrounds,
params = xgb_params
)
message(paste("train_logloss:", mean(model[["evaluation_log"]][["train_logloss"]])))
if(save_model)
{
xgb.save(model, paste("out/", sp, ".model", sep = "")) # Max compatibility with future xgboost versions
save(model, file = paste("out/", sp, ".rds", sep = "")) # Fully restorable r object
}
return(model)
}
Additional training paramters are defined here in a own data frame. Advantage of this are the comparability and useability of the different parameter sets. The Table show the default parameter, one set taken from Roozbeh Valavi et. al. And last the parameters we will user for training. Those have been tested and tuned manually.
rm(xgboost_params)
## Warning in rm(xgboost_params): Objekt 'xgboost_params' nicht gefunden
xgboost_params <- data.frame("dataset" = character(),
"nrounds" = numeric(),
"eta" = numeric(),
"max_depth" = numeric(),
"subsample" = numeric(),
"gamma" = numeric(),
"alpha" = numeric(),
"lambda" = numeric(),
"colsample_bytree" = numeric(),
"min_child_weight" = numeric()
)
xgboost_params <- xgboost_params %>% add_row(dataset = "default", eta = 0.3, max_depth = 6, gamma = 0, alpha = 0, lambda = 1, subsample = 1, colsample_bytree = 1, min_child_weight = 1)
# Parameter taken from https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/ecm.1486
xgboost_params <- xgboost_params %>% add_row(dataset = "literature", nrounds = 1000, eta = 0.001, max_depth = 5, subsample = 0.75, gamma = 0, alpha = 0, lambda = 1, colsample_bytree = 0.8, min_child_weight = 1)
xgboost_params <- xgboost_params %>% add_row(dataset = "tuned 1", nrounds = 2000, eta = 0.01, max_depth = 5, subsample = 1, gamma = 0, alpha = 0, lambda = 1, colsample_bytree = 1, min_child_weight = 1)
xgboost_params <- xgboost_params %>% add_row(dataset = "tuned 2", nrounds = 5000, eta = 0.3, max_depth = 5, subsample = 1, gamma = 2, alpha = 0, lambda = 1, colsample_bytree = 1, min_child_weight = 1)
xgboost_params <- xgboost_params %>% add_row(dataset = "tuned 3", nrounds = 3000, eta = 0.15, max_depth = 7, subsample = 0.75, gamma = 1.5, alpha = 0, lambda = 0.5, colsample_bytree = 1, min_child_weight = 1)
print(xgboost_params)
## dataset nrounds eta max_depth subsample gamma alpha lambda
## 1 default NA 0.300 6 1.00 0.0 0 1.0
## 2 literature 1000 0.001 5 0.75 0.0 0 1.0
## 3 tuned 1 2000 0.010 5 1.00 0.0 0 1.0
## 4 tuned 2 5000 0.300 5 1.00 2.0 0 1.0
## 5 tuned 3 3000 0.150 7 0.75 1.5 0 0.5
## colsample_bytree min_child_weight
## 1 1.0 1
## 2 0.8 1
## 3 1.0 1
## 4 1.0 1
## 5 1.0 1
Next is to perpare data used to predic species occurrence in pakistan. Therefore we will use the raw raster data and predict of those with ‘terra::predict’ which allows us to pass on a ‘Spatraster’ object. XGBoost can’t handle Spatraster, so ‘terra:predict’ allows as to define a custom prediction function, which converts data into a matrix.
# gen stack from rasters bioclim_pak and elevation_pak
elev_pak = resample(elevation_pak, bioclim_pak)
ext(elevation_pak) <- ext(bioclim_pak)
prediction_rstack = c(bioclim_pak, elev_pak)
# Remove values outside pakistan, because otherwise the model will make predictions outside the modeling area
prediction_rstack = terra::mask(prediction_rstack, border_pak)
# We need to make a custom predict function for terra::predict() since xgboost didn't take a data.frame as input. See https://stackoverflow.com/questions/71947124/predict-xgboost-model-onto-raster-stack-yields-error
prediction_custom <- function(model, data, ...) {
stats::predict(model, newdata=as.matrix(data), ...)
}
predict <- function(model, prediction_data, plots=FALSE)
{
#model = xgb.load(paste("out/", sp, "/" ,sp, ".model", sep = ""))
#model = readRDS(paste("out/", sp, "/" ,sp, ".bin", sep = ""))
prediction = terra::predict(object=prediction_data,
model=model,
fun=prediction_custom
)
terra::writeRaster(prediction, paste("out/", sp, ".tif", sep = ""), overwrite=TRUE)
return(prediction)
}
Three different paramter sets used for Aglais_caschmirensis
sp = "Aglais_caschmirensis"
print(xgboost_params[2,])
model <- train(modeling_data, sp, as.list(xgboost_params[2,]), save_model = FALSE)
## train_logloss: 0.420687878569378
prediction <- predict(model, prediction_rstack)
ggplot() +
geom_spatraster(data = prediction) +
scale_fill_hypso_c(direction = -1,
limits=c(0,1),
name = "Prediction") +
geom_sf(data = modeling_data %>% filter(species == sp),
size = 1,
shape = 1 ) +
geom_sf(data = sf::st_as_sf(border_pak),
fill = NA, show.legend=FALSE) +
ggtitle(paste("Oberserved and predicted occurrence of", sp, "in pakistan \n", paste("train_logloss:", mean(model[["evaluation_log"]][["train_logloss"]])), "\n params_index: 2"))
## SpatRaster resampled to ncells = 500703
rm(sp)
## dataset nrounds eta max_depth subsample gamma alpha lambda
## 2 literature 1000 0.001 5 0.75 0 0 1
## colsample_bytree min_child_weight
## 2 0.8 1
## [21:17:33] WARNING: C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0a64b23488439600a-1\xgboost\xgboost-ci-windows\src\learner.cc:767:
## Parameters: { "dataset", "nrounds" } are not used.
sp = "Aglais_caschmirensis"
print(xgboost_params[4,])
model <- train(modeling_data, sp, as.list(xgboost_params[4,]), save_model = FALSE)
## train_logloss: 0.0462355725952563
prediction <- predict(model, prediction_rstack)
ggplot() +
geom_spatraster(data = prediction) +
scale_fill_hypso_c(direction = -1,
limits=c(0,1),
name = "Prediction") +
geom_sf(data = modeling_data %>% filter(species == sp),
size = 1,
shape = 1 ) +
geom_sf(data = sf::st_as_sf(border_pak),
fill = NA, show.legend=FALSE) +
ggtitle(paste("Oberserved and predicted occurrence of", sp, "in pakistan \n", paste("train_logloss:", mean(model[["evaluation_log"]][["train_logloss"]])), "\n params_index: 4"))
## SpatRaster resampled to ncells = 500703
rm(sp)
## dataset nrounds eta max_depth subsample gamma alpha lambda colsample_bytree
## 4 tuned 2 5000 0.3 5 1 2 0 1 1
## min_child_weight
## 4 1
## [21:17:54] WARNING: C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0a64b23488439600a-1\xgboost\xgboost-ci-windows\src\learner.cc:767:
## Parameters: { "dataset", "nrounds" } are not used.
sp = "Aglais_caschmirensis"
print(xgboost_params[5,])
model <- train(modeling_data, sp, as.list(xgboost_params[5,]), save_model = FALSE)
## train_logloss: 0.0216167349648276
prediction <- predict(model, prediction_rstack)
ggplot() +
geom_spatraster(data = prediction) +
scale_fill_hypso_c(direction = -1,
limits=c(0,1),
name = "Prediction") +
geom_sf(data = modeling_data %>% filter(species == sp),
size = 1,
shape = 1 ) +
geom_sf(data = sf::st_as_sf(border_pak),
fill = NA, show.legend=FALSE) +
ggtitle(paste("Oberserved and predicted occurrence of", sp, "in pakistan \n", paste("train_logloss:", mean(model[["evaluation_log"]][["train_logloss"]])), "\n params_index: 5"))
## SpatRaster resampled to ncells = 500703
rm(sp)
## dataset nrounds eta max_depth subsample gamma alpha lambda colsample_bytree
## 5 tuned 3 3000 0.15 7 0.75 1.5 0 0.5 1
## min_child_weight
## 5 1
## [21:18:56] WARNING: C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0a64b23488439600a-1\xgboost\xgboost-ci-windows\src\learner.cc:767:
## Parameters: { "dataset", "nrounds" } are not used.
if(plot) {
#xgb_model[["evaluation_log"]]
importance <- xgb.importance(model = model)
#xgb.plot.importance(importance_matrix = importance)
#xgb.ggplot.deepness(xgb_model)
#xgb.plot.multi.trees(mode = xgb_model, features_keep = 3)
#library("DiagrammeRsvg", "rsvg")
#gr <- xgb.plot.multi.trees(model=xgb_model, features_keep = 5, render=FALSE)
#DiagrammeR::export_graph(gr, 'tree.pdf', width=600, height=1500)
xgb.ggplot.shap.summary(data = as.matrix(modeling_data %>% select(-species_occurrence, -geometry)), model = model )
ggsave(paste(sp, "_shap.png", sep=""), path=paste("out/",sp, sep=""))
rm(importance)
}
Finally we combine all predicted species occurrence into a Map that
indicates how many species might occur in one pixel. First, we define a
threshold above which the prediction should be considered. The
prediction have been saved as tif in ‘out/
time_start <- proc.time()
i <- 0
species = (modeling_data %>% distinct(species) %>% filter(!is.na(species)))$species
# xgboost needs the output dir to exist before saving model
system("mk out")
for(sp in species)
{
time_start_species <- proc.time()
i <- i+1
message(paste("[", i, "/", length(species), "] ", sp, ": ", sep = ""), appendLF=F)
# training and prediction of the species model
model <- train(modeling_data, sp, as.list(xgboost_params[5,]))
prediction <- predict(model, prediction_rstack)
ggplot() +
geom_spatraster(data = prediction) +
scale_fill_hypso_c(direction = -1,
limits=c(0,1),
name = "Prediction") +
geom_sf(data = modeling_data %>% filter(species == sp),
size = 1,
shape = 1 ) +
geom_sf(data = sf::st_as_sf(border_pak),
fill = NA, show.legend=FALSE) +
ggtitle(paste("Oberserved and predicted occurrence of", sp, "in pakistan \n", paste("train_logloss:", mean(model[["evaluation_log"]][["train_logloss"]]))))
ggsave(paste(sp, ".png", sep=""), path="out")
message(paste("Exceution took", proc.time() - time_start_species, "seconds"))
}
## [1/2] Acraea_issoria: train_logloss: 0.00718968736489119
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageExceution took 254.87 secondsExceution took 12.95 secondsExceution took 28.38 secondsExceution took NA secondsExceution took NA seconds
## [2/2] Aglais_caschmirensis: train_logloss: 0.0206861109287276
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageExceution took 362.95 secondsExceution took 17.81 secondsExceution took 38.18 secondsExceution took NA secondsExceution took NA seconds
message(paste("Total Exceution took", proc.time() - time_start, "seconds"))
## Total Exceution took 617.92 secondsTotal Exceution took 30.76 secondsTotal Exceution took 66.65 secondsTotal Exceution took NA secondsTotal Exceution took NA seconds
## [1] 127
## [21:19:36] WARNING: C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0a64b23488439600a-1\xgboost\xgboost-ci-windows\src\learner.cc:767:
## Parameters: { "dataset", "nrounds" } are not used.
##
## [21:20:04] WARNING: C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0a64b23488439600a-1\xgboost\xgboost-ci-windows\src\learner.cc:767:
## Parameters: { "dataset", "nrounds" } are not used.
# Step 1 generate Raster Stack
# l_species will consist of all 421 species prediction rasters -> RAM usage will be insane ~ 25GB
l_species = list()
threshold = 0.5
# reclassify raster:
# value < threshold = 0
# value > threshold = 1
m <- c(0, threshold, 0,
threshold, 1, 1)
m <- matrix(m, ncol=3, byrow=TRUE)
for(sp in (modeling_data %>% distinct(species) %>% filter(!is.na(species)))$species)
{
# get species raster from file system
r <- rast(paste("./out/",sp,".tif", sep = "" ))
l_species[sp] <- terra::classify(r, m, include.lowest = TRUE)
rm(r)
}
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
stack = terra::rast(l_species)
stack = sum(stack)
stack = terra::mask(stack, border_pak)
ggplot() +
geom_spatraster(data = stack) +
scale_fill_hypso_c(palette="spain", name="N° of species" ) +
#scale_fill_hypso_b(name="N° of species") +
geom_sf(data = sf::st_as_sf(border_pak), fill=NA, show.legend=FALSE) +
ggtitle("Species richness of butterflies in Pakistan")
## SpatRaster resampled to ncells = 500703
ggsave("SpeciesRichnessMap.png", path="out")
## Saving 7 x 5 in image
For how many species is no prediction made? -> count(max(raster) = 0)
Installing XGBoost with GPU accerlation
Pre-built binary packages are offered by XGBoost after someone makes
a request on GitHub (https://github.com/dmlc/xgboost/issues/6654). Since the
packages are precompiled, it should not be necessary to compile them
from source. The installation still fails with error
rknitr::inline_expr(“Warning in system(”sh
./configure.win”) ‘sh’ not found”)`. So there are two ways to fix this
problem: - Install R-Tools and build from source - Copy src/xgboost.dll
from archive (https://github.com/dmlc/xgboost/releases) into your r
library manually e.g C:%USERNAME%-library\4.2
Sessioninfo
sessionInfo()
## R version 4.2.3 (2023-03-15 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] xgboost_1.7.5.1 fastDummies_1.6.3 tidyterra_0.4.0 sf_1.0-12
## [5] geodata_0.5-8 terra_1.7-29 ggplot2_3.4.2 dplyr_1.1.2
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 xfun_0.39 bslib_0.4.2 purrr_1.0.1
## [5] lattice_0.20-45 colorspace_2.1-0 vctrs_0.6.2 generics_0.1.3
## [9] htmltools_0.5.5 s2_1.1.3 yaml_2.3.7 utf8_1.2.3
## [13] rlang_1.1.0 e1071_1.7-13 jquerylib_0.1.4 pillar_1.9.0
## [17] glue_1.6.2 withr_2.5.0 DBI_1.1.3 wk_0.7.3
## [21] lifecycle_1.0.3 stringr_1.5.0 munsell_0.5.0 gtable_0.3.3
## [25] ragg_1.2.5 codetools_0.2-19 evaluate_0.21 labeling_0.4.2
## [29] knitr_1.42 fastmap_1.1.1 class_7.3-21 fansi_1.0.4
## [33] highr_0.10 Rcpp_1.0.10 KernSmooth_2.23-20 scales_1.2.1
## [37] classInt_0.4-9 cachem_1.0.7 jsonlite_1.8.5 systemfonts_1.0.4
## [41] farver_2.1.1 textshaping_0.3.6 digest_0.6.31 stringi_1.7.12
## [45] grid_4.2.3 cli_3.6.1 tools_4.2.3 magrittr_2.0.3
## [49] sass_0.4.6 proxy_0.4-27 tibble_3.2.1 tidyr_1.3.0
## [53] pkgconfig_2.0.3 Matrix_1.5-3 data.table_1.14.8 rmarkdown_2.21
## [57] rstudioapi_0.14 R6_2.5.1 units_0.8-1 compiler_4.2.3